Effectiveness of Covid-19 social distancing measures in Ontario
Enobong Udoh
- Project motivation and background
- Data collection
- Note: The baseline in mobility data is the median value, for the corresponding day of the week, during the 5-week period Jan 3–Feb 6, 2020
- Data understanding
- Importing project dependencies
- Utility functions
- Data preperation
- B Ontario Confirmed Positive Cases with age groups
- C. Vaccine data with age groups
- D. Google Covid-19 Community mobility report for Ontario
- 3. Exploratory Analysis
- Ontario covid-19 Overview
- QUESTION 1:
- QUESTION 2:
- QUESTION 3:
- QUESTION 4:
- QUESTION 5:
- 4. Conclusion
- 5. Using Machine learning for Prediction
Project motivation and background
Covid-19 is an infectious respiratory diseases caused by the newly discovered Coronavirus. The novel virus, also known as SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2, formerly called 2019-nCoV), is a family of viruses popularized by their spiky crown. The virus was first detected amid an outbreak of respiratory illness cases in Wuhan City, China and was initially reported by World Health Organization on the 31st of December 2019.
In this work, exploratory analysis is carried out to assess the impact of Ontario's Covid preventative solutions and restrictive measures (mobility), on the daily changes in covid cases.
In particular, this project will explore the following lines of inquiry with the help of a number of publicly accessible data sets:
-
Is there an observable relationship between the reported covid activities and the proposed medical solution i.e. vaccination?
-
Is there an influence on the number of reported cases in ontario by people's activities across the days of the week?
-
With the government's vaccination plan, preference was given to adults who were 70 and over first, as well as those considered medically compromised. Was this as a result of the significance in the number of confirmed positive cases in the age group 70 and above?
-
How does the proportion of affected groups compare with the those getting vaccinated?
-
How has the pandemic impacted the community's mobility? Is there an observable effect on the number of cases in the province?
Data collection
The following datasets were identified to fulfill the analysis requirement:
-
Ontario's Covid-19 Pandemic and Vaccination trends from 25-January-2020 to 17-July-2021
-
Confirmed Positive Cases in cities within Ontario (with age)
-
Ontario Vaccination data (by age)
-
Google Covid-19 mobility report
Note: The baseline in mobility data is the median value, for the corresponding day of the week, during the 5-week period Jan 3–Feb 6, 2020
Data understanding
Features Explored In Ontario's Covid-19 Pandemic and Vaccination trends from 25-January-2020 to 17-July-2021:
-
date- The date of activities captured in the dataset -
change_cases- The number of new cases as of each day -
change_fatalities- The number of new hospitalizations as of each day -
change_tests- The number of new tests as of each day -
change_hospitalizations- The number of new hospitalizations as of each day -
change_criticals- The number of new critcal cases as of each day -
change_recoveries- The number of recovered patients as of each day -
change_vaccinations- The number of new single-dosed vaccinated people as of each day -
change_vaccinated- The number of fully vaccinated people as of each day -
change_vaccines_distributed- The number of vaccines made available to the province as of each day -
total_cases- Total number of covid cases -
total_fatalities- Total number of covid-related fatalities -
total_tests- Total number of covid tests -
total_hospitalizations- Total number of covid-related hospitalizations -
total_critcals- Total number of covid-related critical care patients -
total_recoveries- Total number of recoveries -
total_vaccinations- Total number of covid vaccinations (first dose) -
total_vaccinated- Total number of fully vaccinated people -
total_vaccines_distributed- Total number of vaccines distributed across the province
NOTE: Other Datasets used are accompanied with links to their dictionaries above.
Importing project dependencies
Required libraries are:
-
pandas: required to access dataset .csv file and work with data in tabular representation. -
numpyrequired to round the data in the correlation matrix. -
matplotlib, seaborn, pylabrequired for data visualization. -
missingno, used to understand and visualize the prsence and distribution of missing values in data -
Datetime, used to work with time series data -
pandas_profiling and pandas_profiling.utils.cache, used as a guiding tool to profile date -
sklearn: This library was used to access machine learning modules for prediction-related tasks.
! pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import missingno
import mpl_toolkits.mplot3d as m3d
import seaborn as sns
import matplotlib
from pylab import *
from pylab import rcParams
import pandas_profiling
from pandas_profiling import ProfileReport
from pandas_profiling.utils.cache import cache_file
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
print(f"numpy version: {np.__version__}")
print(f"pandas version: {pd.__version__}")
print(f"pandas profiling version: {pandas_profiling.__version__}")
'''
Searching for percentage of missing data in each column
'''
def precent_na_in_cols(df):
for items in df.columns:
missingvaluecheck = df[items].isnull().mean()
print(f"{items} - {missingvaluecheck :.1%}")
'''
quick duplicate search for entire df with response
'''
def dup_quick_search(df):
if df.duplicated().any()==True:
print('There are some duplicates')
else:
print('There are no duplicates')
'''
duplicate search for non-numeric columns in df with response
'''
def non_num_dup_search(df):
non_number_columns = list(df.select_dtypes(exclude=('int', 'float')).columns)
print(f'Columns without numeric data: {", ".join(non_number_columns)}.')
for column in non_number_columns:
if df[column].duplicated().any()==True:
print(f'The duplicate columns are {column}')
else:
print('There are no duplicate columns in this data')
'''
Search function for finding highest incident value in a column - with date and day
'''
def singleCol_highest_search(df, args):
last_col = df.columns[-1]
print(f"The table below shows the date of the highest number of {args}\n\n")
return df[df[args]==df[args].max()][[args,last_col]]
a. Reviewing the raw Ontario covid-19 cases and vaccine data
ontariocovid_vaccine_raw_df = pd.read_csv('OntarioDS.csv')
ontariocovid_vaccine_raw_df.head(2)
from google.colab import drive
drive.mount('/content/drive')
ontariocovid_vaccine_raw_df.columns
ontariocovid_vaccine_raw_df.dtypes
print(f"The size of the raw ontario covid and vaccince data is {ontariocovid_vaccine_raw_df.size}")
print(f"The shape of the raw ontario covid and vaccince data is {ontariocovid_vaccine_raw_df.shape}")
Observation:
- Column names are lengthy and can be simplified
- The date column is of the wrong data type
- Raw data contains some columns that are not needed for this analysis
a. Cleaning raw Ontario covid-19 cases and vaccine data
ontariocovid_vaccine_raw_df.drop(columns=['province','last_updated'], axis=1, inplace=True, errors='raise')
# reviewing data to see result without dropped columns
ontariocovid_vaccine_raw_df.head(3)
ontariocovid_vaccine_cl_df = ontariocovid_vaccine_raw_df.rename(columns={'data » date':'date','data » change_cases': 'change_cases','data » change_fatalities':'change_fatalities','data » change_tests':'change_tests','data » change_hospitalizations':'change_hospitalizations','data » change_criticals':'change_criticals','data » change_recoveries':'change_recoveries','data » change_vaccinations':'change_vaccinations','data » change_vaccinated':'change_vaccinated','data » change_vaccines_distributed':'change_vaccines_distributed','data » total_cases':'total_cases','data » total_fatalities':'total_fatalities','data » total_tests':'total_tests','data » total_hospitalizations':'total_hospitalizations','data » total_criticals':'total_criticals','data » total_recoveries':'total_recoveries','data » total_vaccinations':'total_vaccinations','data » total_vaccinated':'total_vaccinated','data » total_vaccines_distributed':'total_vaccines_distibuted'})
# reviewing table with renamed columns
ontariocovid_vaccine_cl_df.head(2)
print(f"shape: {ontariocovid_vaccine_cl_df.shape}")
print(f"size: {ontariocovid_vaccine_cl_df.size}")
ontariocovid_vaccine_cl_df.dtypes
ontariocovid_vaccine_cl_df['date'] = pd.to_datetime(ontariocovid_vaccine_cl_df['date']) # alternative method: OntarioDS['Date'].astype('datetime64')
ontariocovid_vaccine_cl_df.tail(2)
ontariocovid_vaccine_cl_df.dtypes
ontariocovid_vaccine_cl_df.info()
# data was double-checked by calculating the percentage of blanks and filled values for each column
precent_na_in_cols(ontariocovid_vaccine_cl_df)
missingno.matrix(ontariocovid_vaccine_cl_df,fontsize=16,figsize=(25,5),color=(0.29,0.5908,0.21)) #width and height in inches
plt.show()
Conclusion: The data has no missing values so no further transformations are required
ontariocovid_vaccine_cl_df.describe()
fig_2 = plt.figure(figsize=(16,9))
gridspec.GridSpec(2,3)
plt.subplot2grid((2,3),(0,0))
# using a box plot to get a clearer view on possible ouliers
# plt.figure(figsize=(5,5))
sns.boxplot(y='change_cases', data=ontariocovid_vaccine_cl_df, color='red')
plt.title("Covid cases distribution analysis")
plt.annotate("limit",(.21,10**3.54))
plt.yscale('log')
plt.ylabel("cases and tests scale")
plt.subplot2grid((2,3),(0,1))
sns.boxplot(y='change_tests', data=ontariocovid_vaccine_cl_df, color='blue')
plt.title("Covid tests distribution analysis")
plt.annotate("limit",(.21,10**4.875))
plt.yscale('log')
plt.ylabel("cases and tests scale")
plt.show()
fig_1 = plt.figure(figsize=(16,9))
gridspec.GridSpec(2,3)
plt.subplot2grid((2,2),(0,0))
# plt.figure(figsize=(5,5))
ontariocovid_vaccine_cl_df['change_cases'].plot()
plt.annotate("cases peak",(449,4812))
plt.legend()
plt.subplot2grid((2,2),(0,1))
# plt.figure(figsize=(5,5))
ontariocovid_vaccine_cl_df['change_tests'].plot()
plt.annotate("test peak",(358,76472),xycoords ='data')
# plt.annotate("test peak",(370,76472),xycoords ='data' ,arrowprops=dict(arrowstyle="->",color='black', shrink=0.0001, headwidth = 0.01,width=0.1))
plt.legend()
plt.show()
Conclusion:
- The data description indicates that there are significant deviations from the mean and the box plot shows that there are possible outliers in the dataset
- Using a line plot, it can be observed that there was a lot of fluctuation in activities throughout the period.
- To further validate that the numbers are not a result of data error, with the information captured on CTV news and Ontario's covid tracker, where the dataset was extracted from, the fluctuation in the events hold true.
- No further transformation required
dup_quick_search(ontariocovid_vaccine_cl_df)
non_num_dup_search(ontariocovid_vaccine_cl_df)
Conclusion:
- While there's a non-numeric column - Date, it contains no duplicated data, so no further transformation required
ontariocovid_vaccine_ts_df = ontariocovid_vaccine_cl_df.set_index('date').tz_localize("Canada/Eastern")
ontariocovid_vaccine_ts_df.index.names =[None] # removing index column name
ontariocovid_vaccine_ts_df.head()
ontariocovid_vaccine_ts_df.info()
print(f'The Shape of the time series version of the data frame is: \t{ontariocovid_vaccine_ts_df.shape}')
print(f'The Size of the time series version of the data frame is: \t{ontariocovid_vaccine_ts_df.size}')
Conclusion:
- Ontario covid and vaccine time series data has been cleaned, converted to timeseries with Canada/Eastern timezone and is ready for processsing.
PROFILING TEST
profile = ProfileReport(ontariocovid_vaccine_ts_df, title="Ontario_Covid_Vaccine", html={'style': {'full_width': True}}, sort=None)
profile.to_widgets()
# profile
Reviewing the raw Ontario confirmed cases data
confirmed_cases_city_raw_df = pd.read_csv('confirmed_positive_cases_of_COVID19_in_Ontario.csv')
confirmed_cases_city_raw_df.sort_values(by='Case_Reported_Date', inplace=True)
confirmed_cases_city_raw_df.head(10)
confirmed_cases_city_raw_df.columns
confirmed_cases_city_raw_df.dtypes
print(f"The size of the raw confirmed cases with age groups data is {confirmed_cases_city_raw_df.size}")
print(f"The shape of the raw confirmed cases with age groups data is {confirmed_cases_city_raw_df.shape}")
Observation:
- Data contains multiple columns that seem to be indices
- Data is captured from multiple cities and need to be grouped by the
date,age_groupsandgenderto make it a cummulative Ontario data for this analysis. - Column names are capitalized and can be made lower case for consistency (not mandatory)
- Some columns are of the wrong data type
- Raw data contains some columns that are not needed for this analysis
- Records need to be adjusted to match target eda end date July 17
# fiirst - copy the city raw df for modification
ont_confirmed_cases_raw_df = confirmed_cases_city_raw_df.copy()
ont_confirmed_cases_raw_df['case_count'] =1
ont_confirmed_cases_raw_df.tail(2)
ont_confirmed_cases_raw_df.duplicated().any()
ont_confirmed_cases_raw_df.isnull().any()
for cols in ont_confirmed_cases_raw_df.columns:
missingvaluecheck = ont_confirmed_cases_raw_df[cols].isnull().mean()
print(f"{cols} - {missingvaluecheck :.1%}")
Observation:
- There are no duplicated records in the confirmed positive cases for the cities in Ontario.
- However, some columns have missing values. Since these columns are not used in the analysis and don't hold crucial data, grouping process would continue without further transformation
# defining the columns I need and resetting the index
ont_confirmed_cases_raw_df = ont_confirmed_cases_raw_df.groupby(['Case_Reported_Date','Age_Group','Client_Gender'])[['case_count']].agg(sum).reset_index()
ont_confirmed_cases_raw_df .head()
ont_confirmed_cases_raw_df.columns =[ 'date','age_group','gender','case_count']
ont_confirmed_cases_raw_df.head(2)
# first: copy ontario confirmed cases into a new variable for editing
confirmed_cases_cl_df = ont_confirmed_cases_raw_df.copy()
confirmed_cases_cl_df.dtypes
confirmed_cases_cl_df['age_group']
confirmed_cases_cl_1_df = confirmed_cases_cl_df[~confirmed_cases_cl_df['age_group'].isin(['UNKNOWN'])]
confirmed_cases_cl_1_df['age_group']
percent_of_retained_confirmed_cases = (len(confirmed_cases_cl_1_df['age_group'])/len(confirmed_cases_cl_df['age_group']))
print(f"The percentage of age data retained is: {percent_of_retained_confirmed_cases:.2%}")
confirmed_cases_cl_1_df['age_group'] = confirmed_cases_cl_1_df['age_group'].astype('category')
confirmed_cases_cl_1_df['gender'] = confirmed_cases_cl_1_df['gender'].astype('category')
confirmed_cases_cl_1_df['date'] = pd.to_datetime(confirmed_cases_cl_1_df['date'])
confirmed_cases_cl_1_df.dtypes
confirmed_cases_cl_2_df = confirmed_cases_cl_1_df[(confirmed_cases_cl_1_df['date']>='2020-01-23') & (confirmed_cases_cl_1_df['date']<='2021-07-17')]
confirmed_cases_cl_2_df.tail(2)
print(confirmed_cases_cl_2_df.size)
print(confirmed_cases_cl_2_df.shape)
precent_na_in_cols(confirmed_cases_cl_2_df)
confirmed_cases_cl_2_df.isna().any()
confirmed_cases_cl_2_df['age_group']
Conclusion:
- There is no missing data in the data set
confirmed_cases_cl_2_df.describe()
plt.figure(figsize=(5,5))
sns.boxplot(y='case_count', data=confirmed_cases_cl_2_df, color='green')
plt.yscale('log')
plt.title("Confirmed cases distribution analysis")
plt.annotate("maximum value",(.21,10**2.21))
plt.annotate("minimum value",(.21,10**0))
plt.ylabel("cases scale")
plt.show()
plt.figure(figsize=(5,5))
confirmed_cases_cl_2_df['case_count'].plot()
x = confirmed_cases_cl_2_df[(confirmed_cases_cl_2_df['case_count']==confirmed_cases_cl_2_df['case_count'].max())].index[0]
plt.annotate("559 cases",(x,confirmed_cases_cl_2_df['case_count'].max()))
plt.legend()
confirmed_cases_cl_2_df['case_count'].max()
Conclusion:
- The data description indicates that theres substantial deviation from the mean and the box plot shows that there are possible outliers in the dataset
- Using a line plot, it can be observed that there was a lot of fluctuation in activities throughout the period and difference is likely due to that.
- No further transformation required
dup_quick_search(confirmed_cases_cl_2_df)
Conclusion:
- There are no duplicates in the data
# localizing the time to canadian timezone
confirmed_cases_ts_df = confirmed_cases_cl_2_df.set_index('date', drop=True).tz_localize('Canada/Eastern')
confirmed_cases_ts_df.tail(30)
print(f"confirmed cases final data shape: {confirmed_cases_ts_df.shape}")
print(f"confirmed cases final data size: {confirmed_cases_ts_df.size}")
profile_b = ProfileReport(confirmed_cases_city_raw_df, title="Confirmed_cases_by_age_Public_Health_data", html={'style': {'full_width': True}}, sort=None)
profile_b.to_widgets()
Conclusion
- The percentage of age data retained is: 99.14%
- The Confirmed cases data has been cleaned, converted to timeseries with Canada/Eastern timezone and is ready for processsing.
Reviewing the raw Ontario vaccination data
vacc_data_raw_df = pd.read_csv('ontario_vaccination_data_by_age.csv')
vacc_data_raw_df.head(10)
vacc_data_raw_df.columns
vacc_data_raw_df.dtypes
print(f"The size of the raw vaccination with age groups data is {vacc_data_raw_df.size}")
print(f"The shape of the raw vaccination with age groups data is {vacc_data_raw_df.shape}")
Observation:
- Data contains multiple columns that seem to be indices
- Column names are capitalized and can be made lower case (not mandatory)
- Some columns are of the wrong data type
- Raw data contains some columns that are not needed for this analysis
- Records need to be adjusted to match target eda end date July 17
- The age column contains an 'Undisclosed_or_missing' category and it doesn't seem to be actual age category as there's no recorded population number for those people
- The age column contains 'Adults_18plus' and 'Ontario_12plus' which are just cummulatives of the age ranges in these categories
vacc_data_cl_df = vacc_data_raw_df.set_index('_id', drop=True)
vacc_data_cl_df.index.names =[None]
vacc_data_cl_df.head(2)
vacc_data_cl_df.columns = ['date', 'age_group', 'partially_vaccinated', 'fully_vaccinated',
'total_population', '%_partially_vaccinated', '%_fully_vaccinated']
vacc_data_cl_df.head(2)
# vacc_data_cl_df.dtypes
vacc_data_cl_df['date'] = pd.to_datetime(vacc_data_cl_df['date'])
vacc_data_cl_df['age_group'] = vacc_data_cl_df['age_group'].astype('category')
vacc_data_cl_df.dtypes
vacc_data_cl_df.duplicated().any()
vacc_data_cl_df.isna().any()
vacc_data_cl_df.info()
vacc_data_cl_df[vacc_data_cl_df['%_partially_vaccinated'].isna()==True]
Observation
- There are 240 rows with missing data as the age groups of these vaccinated indiviuals were not provided and under the group 'undisclosed_or_missing'.
- hence, there is not population percentage for them.
vacc_data_cl_2_df = vacc_data_cl_df.copy()
vacc_data_cl_2_df.dropna(inplace=True)
vacc_data_cl_2_df.isna().any()
vacc_data_cl_3_df = vacc_data_cl_2_df[~vacc_data_cl_2_df['age_group'].isin(["Adults_18plus","Ontario_12plus"])]
vacc_data_cl_3_df.head(10)
vacc_data_cl_3_df = vacc_data_cl_3_df[vacc_data_cl_3_df['date']<='2021-07-17'].reset_index(drop=True)
vacc_data_cl_3_df.tail(2)
precent_na_in_cols(vacc_data_cl_3_df)
conclusion
-
There are no columns with missing data, hence all have 0.0% missing data result.
-
No further transformation required
vacc_data_cl_3_df.describe()
plt.figure(figsize=(5,5))
vacc_data_cl_3_df['partially_vaccinated'].plot()
vacc_data_cl_3_df['fully_vaccinated'].plot()
plt.show()
Conclusion
- Trend shows consistent progression and so values considered outliers are likely not.
- No transformation done
dup_quick_search(vacc_data_cl_3_df)
Conclusion
- No duplicates found and no further transformation has been carried out
vacc_data_ts_df = vacc_data_cl_3_df.set_index('date', drop=True).tz_localize('Canada/Eastern')
vacc_data_ts_df.head(10)
profile_c = ProfileReport(vacc_data_ts_df, title="Vaccination_by_age_Public_Health_data", html={'style': {'full_width': True}}, sort=None)
profile_c.to_widgets()
Conclusion
- The vaccination with age data has been cleaned, converted to timeseries with Canada/Eastern timezone and is ready for processsing.
d. Reviewing ontario mobility data for 2020 and 2021
mobility_2020_raw_df =pd.read_csv('2020_ca_region_mobility_report.csv')
mobility_2020_raw_df.head()
mobility_2020_raw_df.columns
mobility_2020_raw_df.dtypes
print(f"The size of the raw 2020 mobility data is {mobility_2020_raw_df.size}")
print(f"The shape of the raw 2020 mobility data is {mobility_2020_raw_df.shape}")
mobility_2021_raw_df =pd.read_csv('2021_ca_region_mobility_report.csv')
mobility_2021_raw_df.tail()
mobility_2021_raw_df.columns
mobility_2021_raw_df.dtypes
print(f"The size of the raw 2021 mobility data is {mobility_2021_raw_df.size}")
print(f"The shape of the raw 2021 mobility data is {mobility_2021_raw_df.shape}")
Observation:
- mobility data started being captured in February while covid was first detected in January in the province
- Data contains multiple regions but only would be used in this analysis
- Some columns are of the wrong data type
- Raw data contains some columns that are not needed for this analysis
- Records need to be adjusted to match target eda end date July 17
cleaning 2020
mobility_2020_cl_df = mobility_2020_raw_df.copy()
mobility_2020_cl_df = mobility_2020_cl_df[mobility_2020_raw_df['sub_region_1']=='Ontario']
mobility_2020_cl_df.head(3)
mobility_2020_cl_1_df = mobility_2020_cl_df.copy()
mobility_2020_cl_1_df = mobility_2020_cl_df[['date','sub_region_1','retail_and_recreation_percent_change_from_baseline', 'grocery_and_pharmacy_percent_change_from_baseline', 'parks_percent_change_from_baseline','transit_stations_percent_change_from_baseline', 'workplaces_percent_change_from_baseline', 'residential_percent_change_from_baseline']]
mobility_2020_cl_1_df.head(2)
mobility_2020_cl_1_df.dtypes
mobility_2020_cl_1_df['date'] = mobility_2020_cl_1_df['date'].astype('datetime64')
mobility_2020_cl_1_df.dtypes
mobility_2020_cl_1_df.shape
cleaning 2021
mobility_2021_cl_df = mobility_2021_raw_df[mobility_2021_raw_df['sub_region_1']=='Ontario']
mobility_2021_cl_df.tail(3)
mobility_2021_cl_1_df = mobility_2021_cl_df[['date','sub_region_1','retail_and_recreation_percent_change_from_baseline', 'grocery_and_pharmacy_percent_change_from_baseline', 'parks_percent_change_from_baseline','transit_stations_percent_change_from_baseline', 'workplaces_percent_change_from_baseline', 'residential_percent_change_from_baseline']]
mobility_2021_cl_1_df.head(2)
mobility_2021_cl_1_df.shape
mobility_2021_cl_1_df.dtypes
mobility_2021_cl_1_df['date'] = pd.to_datetime(mobility_2021_cl_1_df['date'])
mobility_2021_cl_1_df.head(2)
mobility_2021_cl_1_df.dtypes
precent_na_in_cols(mobility_2020_cl_1_df)
-
There are multiple columns with missing data.
-
Let's dig deeper into the location and other content of these records with missing information
mobility_2020_cl_1_df[(mobility_2020_cl_1_df['retail_and_recreation_percent_change_from_baseline'].isna()==True)|
(mobility_2020_cl_1_df['grocery_and_pharmacy_percent_change_from_baseline'].isna()==True) |
(mobility_2020_cl_1_df['parks_percent_change_from_baseline'].isna()==True) |
(mobility_2020_cl_1_df['transit_stations_percent_change_from_baseline'].isna()==True) |
(mobility_2020_cl_1_df['workplaces_percent_change_from_baseline'].isna()==True) |
(mobility_2020_cl_1_df['residential_percent_change_from_baseline'].isna()==True)
]
mobility_2020_cl_1_df['date'].duplicated().any()
mobility_2020_cl_1_df[mobility_2020_cl_1_df['date']=='2020-02-16'].head(5)
# then, use average for dates for further processing
mobility_2020_cl_1_df.fillna(0, inplace=True)
precent_na_in_cols(mobility_2020_cl_1_df)
The 2020 mobility data no longer has missing values
dup_quick_search(mobility_2020_cl_1_df)
len(mobility_2020_cl_1_df[mobility_2020_cl_1_df.duplicated()==True])
# since unsure of data, group by date and take averages for duplicated dates
mobility_2020_grp_df = mobility_2020_cl_1_df.groupby('date')[mobility_2020_cl_1_df.columns].mean().reset_index()
mobility_2020_grp_df.head(2)
mobility_2020_grp_df.duplicated().any()
checking for missing values in 2021
precent_na_in_cols(mobility_2021_cl_1_df)
mobility_2021_cl_1_df['date'].duplicated().any()
# then, use average for dates for further processing
mobility_2021_cl_1_df.fillna(0, inplace=True)
precent_na_in_cols(mobility_2021_cl_1_df)
The 2021 mobility data no longer has missing values
dup_quick_search(mobility_2021_cl_1_df)
len(mobility_2021_cl_1_df[mobility_2021_cl_1_df.duplicated()==True])
# since unsure of data, group by date and take averages for duplicated dates
mobility_2021_grp_df = mobility_2021_cl_1_df.groupby('date')[mobility_2021_cl_1_df.columns].mean().reset_index()
mobility_2021_grp_df.head(2)
mobility_2021_grp_df.duplicated().any()
2020
mobility_2020_grp_df.describe()
plt.figure(figsize=(14,13))
mobility_2020_grp_df['retail_and_recreation_percent_change_from_baseline'].plot()
mobility_2020_grp_df['grocery_and_pharmacy_percent_change_from_baseline'].plot()
mobility_2020_grp_df['parks_percent_change_from_baseline'].plot()
mobility_2020_grp_df['transit_stations_percent_change_from_baseline'].plot()
mobility_2020_grp_df['workplaces_percent_change_from_baseline'].plot()
mobility_2020_grp_df['residential_percent_change_from_baseline'].plot()
plt.legend(loc='upper right')
plt.show()
2021
mobility_2021_grp_df.describe()
plt.figure(figsize=(14,13))
mobility_2021_grp_df['retail_and_recreation_percent_change_from_baseline'].plot()
mobility_2021_grp_df['grocery_and_pharmacy_percent_change_from_baseline'].plot()
mobility_2021_grp_df['parks_percent_change_from_baseline'].plot()
mobility_2021_grp_df['transit_stations_percent_change_from_baseline'].plot()
mobility_2021_grp_df['workplaces_percent_change_from_baseline'].plot()
mobility_2021_grp_df['residential_percent_change_from_baseline'].plot()
plt.legend(loc='upper left')
plt.show()
A number of peaks can be obesrved in the data, however, considering the public health restrictions during the pandemic and the human consciuosness, the likelihood of having short periods of peaks with eased restrictions is increased. Data would be retained in current state for analysis.
dup_quick_search(mobility_2020_grp_df)
len(mobility_2020_grp_df)
mobility_2020_grp_df['date'].unique
dup_quick_search(mobility_2021_grp_df)
len(mobility_2020_grp_df)
mobility_2021_grp_df['date'].unique
No duplicates found
mobility_2021_grp_sl_df = mobility_2021_grp_df[mobility_2021_grp_df['date']<='2021-07-17']
mobility_2021_grp_sl_df.tail(3)
mobility_2020_ts_df = mobility_2020_grp_df.set_index('date', drop=True).tz_localize('Canada/Eastern')
mobility_2020_ts_df.index.names=[None]
mobility_2020_ts_df.head(1)
print(f"Size of mobility ts data 2020: {mobility_2020_ts_df.size} ")
print(f"Shape of mobility ts data 2020: {mobility_2020_ts_df.shape} ")
print(len(mobility_2020_ts_df))
mobility_2021_ts_df = mobility_2021_grp_sl_df.set_index('date', drop=True).tz_localize('Canada/Eastern')
mobility_2021_ts_df.index.names =[None]
mobility_2021_ts_df.head(1)
print(f"The size of mobility_ts data 2020: {mobility_2021_ts_df.size} ")
print(f"The shape of mobility_ts data 2021: {mobility_2021_ts_df.shape} ")
mobility_ts_df = mobility_2020_ts_df.append(mobility_2021_ts_df)
mobility_ts_df.index.tz_convert('Canada/Eastern')
mobility_ts_df.head(2)
print(f"The size of mobility_ts data 2020: {mobility_ts_df.size} ")
print(f"The shape of mobility_ts data 2021: {mobility_ts_df.shape} ")
profile_d = ProfileReport(mobility_ts_df, title="Google Mobility Data for Ontario", html={'style': {'full_width': True}}, sort=None)
profile_d.to_widgets()
Conclusion
- It is important to note that almost all the columns selected for this analysis had null values, however, on further investigation, it was observed that there was a large number of duplicated rows in the data and most were part of the records with NA values.
- To minimize data loss, the na values were filled with 0, the data was grouped by dates and an average percent mobility value was computed for duplicated records.
- A number of peaks can be obesrved in the data, however, considering the public health restrictions during the pandemic and the human consciuosness, the likelihood of having short periods of peaks with eased restrictions is increased. Data would be retained in current state for analysis.
- The points above apply to data for 2020 and 2021.
- The mobility data for 2020 and 2021 has now been cleaned, combined into a new df, converted to timeseries with Canada/Eastern timezone and is ready for processsing.
window=30
plt.figure(figsize=(13, 7))
plt.bar(ontariocovid_vaccine_ts_df.index, ontariocovid_vaccine_ts_df['total_cases'])
plt.plot(ontariocovid_vaccine_ts_df.index, ontariocovid_vaccine_ts_df['total_cases'].rolling(window).mean(), color='orange', linestyle='dashed')
plt.title('Total Cases', size=25)
plt.xlabel('Days Since 25-Jan-2020', size=18)
plt.ylabel('# of Cases', size=18)
plt.legend(['Moving Average {} Days'.format(window), 'Daily Changes in COVID-19 Cases'], prop={'size': 16})
plt.xticks(size=15, rotation=45)
plt.yticks(size=15)
plt.show()
window=30
plt.figure(figsize=(13, 7))
plt.bar(ontariocovid_vaccine_ts_df.index, ontariocovid_vaccine_ts_df['total_tests'])
plt.plot(ontariocovid_vaccine_ts_df.index, ontariocovid_vaccine_ts_df['total_tests'].rolling(window).mean(), color='orange', linestyle='dashed')
plt.title('Total Tests', size=25)
plt.xlabel('Days Since 25-Jan-2020', size=18)
plt.ylabel('# of Tests', size=18)
plt.legend(['Moving Average {} Days'.format(window), 'Daily Changes in COVID-19 tests'], prop={'size': 16})
plt.xticks(size=15, rotation=45)
plt.yticks(size=15)
plt.show()
window=30
plt.figure(figsize=(13, 7))
plt.bar(ontariocovid_vaccine_ts_df.index, ontariocovid_vaccine_ts_df['total_hospitalizations'])
plt.plot(ontariocovid_vaccine_ts_df.index, ontariocovid_vaccine_ts_df['total_hospitalizations'].rolling(window).mean(), color='orange', linestyle='dashed')
plt.title('Total Hospitalizations', size=25)
plt.xlabel('Days Since 25-Jan-2020', size=18)
plt.ylabel('# of Hospitalizations', size=18)
plt.legend(['Moving Average {} Days'.format(window), 'Daily Changes in COVID-19 hospitalizations'], prop={'size': 16})
plt.xticks(size=15, rotation=45)
plt.yticks(size=15)
plt.show()
window=30
plt.figure(figsize=(13, 7))
plt.bar(ontariocovid_vaccine_ts_df.index, ontariocovid_vaccine_ts_df['total_recoveries'])
plt.plot(ontariocovid_vaccine_ts_df.index, ontariocovid_vaccine_ts_df['total_recoveries'].rolling(window).mean(), color='orange', linestyle='dashed')
plt.title('Total Recoveries', size=25)
plt.xlabel('Days Since 25-Jan-2020', size=18)
plt.ylabel('# of Recoveries', size=18)
plt.legend(['Moving Average {} Days'.format(window), 'Daily Changes in COVID-19 recoveries'], prop={'size': 16})
plt.xticks(size=15, rotation=45)
plt.yticks(size=15)
plt.show()
window=30
plt.figure(figsize=(13, 7))
plt.bar(ontariocovid_vaccine_ts_df.index, ontariocovid_vaccine_ts_df['total_fatalities'])
plt.plot(ontariocovid_vaccine_ts_df.index, ontariocovid_vaccine_ts_df['total_fatalities'].rolling(window).mean(), color='orange', linestyle='dashed')
plt.title('Total Fatalities', size=25)
plt.xlabel('Days Since 25-Jan-2020', size=18)
plt.ylabel('# of Fatalities', size=18)
plt.legend(['Moving Average {} Days'.format(window), 'Daily Changes in COVID-19 Fatalities'], prop={'size': 16})
plt.xticks(size=15, rotation=45)
plt.yticks(size=15)
plt.show()
window=30
plt.figure(figsize=(13, 7))
plt.bar(ontariocovid_vaccine_ts_df.index, ontariocovid_vaccine_ts_df['total_vaccinations'])
plt.plot(ontariocovid_vaccine_ts_df.index, ontariocovid_vaccine_ts_df['total_vaccinations'].rolling(window).mean(), color='orange', linestyle='dashed')
plt.title('Total Partial Vaccinations', size=25)
plt.xlabel('Days Since 25-Jan-2020', size=18)
plt.ylabel('# of Partial Vaccinations', size=18)
plt.legend(['Moving Average {} Days'.format(window), 'Daily Changes in COVID-19 Partially Vaccinated People'], prop={'size': 16})
plt.xticks(size=15, rotation=45)
plt.yticks(size=15)
plt.show()
window=30
plt.figure(figsize=(13, 7))
plt.bar(ontariocovid_vaccine_ts_df.index, ontariocovid_vaccine_ts_df['total_vaccinated'])
plt.plot(ontariocovid_vaccine_ts_df.index, ontariocovid_vaccine_ts_df['total_vaccinated'].rolling(window).mean(), color='orange', linestyle='dashed')
plt.title('Total Full Vaccinations', size=25)
plt.xlabel('Days Since 25-Jan-2020', size=18)
plt.ylabel('# of vaccinations', size=18)
plt.legend(['Moving Average {} Days'.format(window), 'Daily Changes in COVID-19 Fully Vaccinated People'], prop={'size': 16})
plt.xticks(size=15, rotation=45)
plt.yticks(size=15)
plt.show()
ontariocovid_vaccine_corr_df = ontariocovid_vaccine_ts_df.corr()
ontariocovid_vaccine_corr_df.head(2)
plt.figure(figsize=(19.8,12))
sns.heatmap(ontariocovid_vaccine_corr_df, cmap='ocean', linewidths=2,vmax=1, vmin=0, square=True, annot=True)
plt.title("Assessing the levels of correlation between ontario covid activities")
plt.show()
g = sns.PairGrid(ontariocovid_vaccine_corr_df[['total_cases', 'total_fatalities', 'total_tests', 'total_hospitalizations', 'total_criticals', 'total_recoveries', 'total_vaccinations', 'total_vaccinated']])
g.map(sns.scatterplot, color ='olive')
plt.show()
Observation:
-
Since Change cases shows the number of cases per day, correlation might be skewed as events take on new values daily.
-
A streamline review of total cases vs covid activities and preventive solutions is done next
plt.figure(figsize=(13,7))
sns.heatmap(ontariocovid_vaccine_corr_df.loc['total_cases':,'total_cases':], annot=True, cmap='crest')
plt.title("Assessing the levels of correlation between 'Total' ontario covid activities")
plt.show()
Observation
-
The plot above shows that most of the correlation in the total data are positive. Although the correlation between hospitalizations, criticals and vaccination activities are low and close to none.
-
From the plot above, it can be observed that there is a positive correlation between total cases and the total values of other covid-related activities.
# let's create data frame that'll hold information from when Ontario had over 0 partially vaccinated people
vaccine_activities_df= ontariocovid_vaccine_ts_df[ontariocovid_vaccine_ts_df['change_vaccinations']>0]
vaccine_activities_df.head(3)
plt.figure(figsize=(10,5))
sns.set_style("whitegrid")
sns.scatterplot(data=vaccine_activities_df, x='change_cases',y='change_vaccinations',hue='change_vaccinated',legend ='auto', alpha=0.8)
plt.title('Trend of Covid Cases in Ontario Post-Vaccination Activities', fontdict={'color':'purple','fontsize':16,'fontweight':'bold'})
plt.xlabel('Number of Covid Cases')
plt.ylabel('Number of vaccinations')
plt.show()
plt.figure(figsize=(19,11.5))
sns.heatmap(vaccine_activities_df.corr(), cmap='ocean', linewidths=2, vmax=1, vmin=0, square=True, annot=True)
plt.title("Assessing the levels of correlation between ontario covid activities - Post Vaccination")
plt.show()
Observation
-
The charts above shows that there the covid related events and vaccination activities in the province are correlated.
-
With a focus on the correlation of total cases to all the other element, in the data pre and post vaccination, it can be observed that there is a positive correlation between total cases and other totals but its correlation with hospitalizations and criticals are the lowest.
day_vs_cases_df = ontariocovid_vaccine_ts_df.loc[:,'change_cases':'total_cases']
day_vs_cases_df.head()
day_vs_cases_df['day_of_week'] = day_vs_cases_df.index.dayofweek
day_vs_cases_df['day_name'] = day_vs_cases_df.index.day_name()
day_vs_cases_df.head()
# day_vs_cases_df.isna().any() # double-checking for any missing value: none found
sns.set(style='darkgrid')
plt.figure(figsize=(13,8))
sns.heatmap(day_vs_cases_df.corr(), annot=True, cmap='crest')
sns.set(style='darkgrid')
sns.set(palette='gist_earth')
fig = plt.figure(figsize=(25,80))
g =sns.FacetGrid(day_vs_cases_df, col='day_name')
g.map(sns.histplot, 'change_cases',kde=True, color='red')
plt.xlabel("Cummulative Daily_cases")
r = sns.FacetGrid(day_vs_cases_df, col='day_name')
r.map(sns.histplot, 'total_cases', kde=True, color='olive')
plt.xlabel("Cummulative total_cases")
plt.show()
gridspec.GridSpec(1,2)
fig = plt.figure(figsize=(18,5.5))
sns.set_style("darkgrid")
plt.subplot2grid((1,2),(0,0))
sns.barplot(x=day_vs_cases_df['day_name'], y =day_vs_cases_df['change_cases'], palette='crest')
plt.xticks(rotation = 45)
plt.title("Cummulative daily cases by day of the week", fontdict={'fontsize':14,'fontweight':'bold'})
plt.ylabel("Cummulative Daily cases")
plt.xlabel("Day of the week")
plt.subplot2grid((1,2),(0,1))
sns.barplot(x=day_vs_cases_df['day_name'],y=day_vs_cases_df['total_cases'], color = 'grey')
plt.xticks(rotation = 45)
plt.title("Cummulative total cases by day of the week", fontdict={'fontsize':14,'fontweight':'bold'})
plt.ylabel("Cummulative total cases")
plt.xlabel("Day of the week")
plt.show()
singleCol_highest_search(day_vs_cases_df, 'change_cases')
Observation:
-
The correlation heatmap shows that there is very low correlation between most covid activities and the days of the week.
-
Despite low correlation, from the plots, a trend can be observed in the cummulative daily changes in covid cases, across weekdays.
-
Daily covid cases, based on the data, seem to rises on Thursday into the weekend, fluctuates bewteen saturday to sunday and makes a U-shaped movement between Monday and Wednesday. From the data, Friday, April 16, is the day with the highest number of cases.
confirmed_cases_ts_df.head(2)
confirmed_cases_ts_df.shape
confirmed_cases_age_grp_df = confirmed_cases_ts_df.copy()
confirmed_cases_age_grp_df.head(2)
confirmed_cases_age_grp_df.dtypes
confirmed_cases_age_grp_df['age_limit'] = confirmed_cases_age_grp_df['age_group'].copy()
confirmed_cases_age_grp_df['age_limit']
confirmed_cases_age_grp_df['age_limit'] = confirmed_cases_age_grp_df['age_limit'].str.replace('<20','19')
confirmed_cases_age_grp_df['age_limit'] = confirmed_cases_age_grp_df['age_limit'].str.replace('s','')
confirmed_cases_age_grp_df['age_limit'] = confirmed_cases_age_grp_df['age_limit'].str.replace('+','')
confirmed_cases_age_grp_df.tail(2)
confirmed_cases_age_grp_df.dtypes
confirmed_cases_age_grp_df['age_limit']= confirmed_cases_age_grp_df['age_limit'].astype('int')
# since bins must be longer than labels, start at 0
cut_points = [0,19,20,30,40,50,60,70,80,90]
label_names = ['12-19','20-29','30-39','40-49','50-59','60-69','70-79','80-89','90+']
confirmed_cases_age_grp_df['age_category'] = pd.cut(confirmed_cases_age_grp_df['age_limit'], bins=cut_points, labels=label_names)
confirmed_cases_age_grp_df.tail(10)
confirmed_cases_age_grp_2020_df = confirmed_cases_age_grp_df[confirmed_cases_age_grp_df.index.year ==2020]
confirmed_cases_age_grp_2020_df.tail()
confirmed_cases_age_grp_2021_df = confirmed_cases_age_grp_df[confirmed_cases_age_grp_df.index.year ==2021]
confirmed_cases_age_grp_2021_df.tail()
age_cases_2020_grouped_df = confirmed_cases_age_grp_2020_df.groupby('age_category')[['case_count']].sum()
age_cases_2020_grouped_df.head(10)
age_cases_2021_grouped_df = confirmed_cases_age_grp_2021_df.groupby('age_category')[['case_count']].sum()
age_cases_2021_grouped_df.head(10)
age_cases_both_grouped_df = confirmed_cases_age_grp_df.groupby('age_category')[['case_count']].sum()
age_cases_both_grouped_df.head(10)
gridspec.GridSpec(1,3)
label = ['Under 20','20s','30s','40s','50s','60s','70s','80s','90 and over']
color = ['#1D2F6F', '#8390FA', '#6EAF46', '#FAC748','#2FAE9F','#D5AE9F','#D52A70','#552A8A','#D4C48A']
fig = plt.figure(figsize=(35,40))
plt.subplot2grid((1,3),(0,0))
plt.pie(x='case_count', labels=label, data = age_cases_2020_grouped_df, textprops= {'fontsize':16,'fontweight':'bold'}, rotatelabels=30, autopct='%1.1f%%', shadow=True, colors=color)
plt.title("2020 Cases by Age groups", fontdict={'fontsize':30,'fontweight':'bold','color':'darkblue'})
plt.subplot2grid((1,3),(0,1))
plt.pie(x='case_count', labels=label, data = age_cases_2021_grouped_df, textprops= {'fontsize':16,'fontweight':'bold'}, rotatelabels=20, autopct='%1.1f%%', shadow=True, colors=color)
plt.title("2021 Cases by Age groups", fontdict={'fontsize':30,'fontweight':'bold','color':'darkblue'})
plt.subplot2grid((1,3),(0,2))
plt.pie(x='case_count', labels=label, data = age_cases_both_grouped_df, textprops= {'fontsize':16,'fontweight':'bold'}, rotatelabels=45, autopct='%1.1f%%', shadow=True, colors=color)
plt.title("Both years combined", fontdict={'fontsize':30,'fontweight':'bold','color':'darkblue'})
plt.legend(loc='upper right', ncol=3)
plt.show()
age_cat_frq_df_2020 = confirmed_cases_age_grp_2020_df['age_category'].value_counts().sort_index().to_frame()
age_cat_frq_df_2020
age_cat_frq_df_2020['normalized_freq'] = confirmed_cases_age_grp_2020_df['age_category'].value_counts(normalize=True)*100
age_cat_frq_df_2020['cummulative_freq'] = age_cat_frq_df_2020['normalized_freq'].cumsum()
age_cat_frq_df_2020
age_cat_frq_df_2021 = confirmed_cases_age_grp_2021_df['age_category'].value_counts().sort_index().to_frame()
age_cat_frq_df_2021
age_cat_frq_df_2021['normalized_freq'] = confirmed_cases_age_grp_2021_df['age_category'].value_counts(normalize=True)*100
age_cat_frq_df_2021['cummulative_freq'] = age_cat_frq_df_2021['normalized_freq'].cumsum()
age_cat_frq_df_2021
label = ['Under 20','20s','30s','40s','50s','60s','70s','80s','90 and over']
width= 0.45
err_2020 = age_cat_frq_df_2020['normalized_freq'].max()
err_2021 = age_cat_frq_df_2021['normalized_freq'].max()
pltx=0
fig, ax = plt.subplots(1, figsize=(13,6))
ax.bar(x=label, height=age_cat_frq_df_2020['normalized_freq'], width=width, label='2020')
ax.bar(x=label, height=age_cat_frq_df_2021['normalized_freq'], width=width, bottom=age_cat_frq_df_2020['normalized_freq'], label='2021')
ax.set_ylabel('Frequency')
ax.set_title('Confirmed Cases Distribution by age groups')
ax.legend(loc='upper right', ncol = 2)
plt.annotate("23.64",(38,297.24), xycoords='axes points')
plt.annotate("24.86",(115,312), xycoords='axes points')
plt.annotate("24.49",(190,307), xycoords='axes points')
plt.show()
CONCLUSION:
- 99.14% percent of the age groups of positive cases were accurately provided and used in this analysis.
- Based on that data, it can be observed that young adults, in their 20s, were the population with the most number of positive covid cases in ontario.
gridspec.GridSpec(1,2)
fig = plt.figure(figsize=(15,6))
plt.subplot2grid((1,2),(0,0))
age_cat_frq_df_2020['cummulative_freq'].plot(color='r')
plt.title("Frequency of Cases by Age: 2020", fontdict={'fontsize':20,'fontweight':'bold','color':'darkblue'})
plt.annotate("population under 70 years: 70.58%", (231.79,209.2), xycoords='axes points')
plt.legend(loc='upper left')
plt.grid()
plt.subplot2grid((1,2),(0,1))
age_cat_frq_df_2021['cummulative_freq'].plot()
plt.title("Frequency of Cases by Age: 2021", fontdict={'fontsize':20,'fontweight':'bold','color':'darkblue'})
plt.annotate("population under 70 years: 72.25%", (231.98,216.5), xycoords='axes points')
plt.legend(loc='upper left')
plt.grid()
plt.show()
CONCLUSION:
- Based on that data, in 2020, it was observed that about 70.58% of people affected by covid were under 70.
- In 2021, approximately 72.25% of people affected by covid were under 70years of age.
# create age_bin for vaccination data that for those less than 19 - so it matches
vacc_data_ts_df.head(10)
vacc_data_ts_processing_df = vacc_data_ts_df.copy()
vacc_data_ts_processing_df.head(10)
vacc_data_ts_grp_df = vacc_data_ts_processing_df.groupby('age_group')[vacc_data_ts_processing_df.columns].sum()
vacc_data_ts_grp_df.head(10)
vacc_data_ts_grp_df.drop(index=['Adults_18plus','Ontario_12plus','Undisclosed_or_missing'],inplace=True)
vacc_data_ts_grp_df['%_of_partial_across_groups'] = vacc_data_ts_grp_df['partially_vaccinated']/vacc_data_ts_grp_df['partially_vaccinated'].sum()*100
vacc_data_ts_grp_df['%_of_full_across_groups'] = vacc_data_ts_grp_df['fully_vaccinated']/vacc_data_ts_grp_df['fully_vaccinated'].sum() * 100
vacc_data_ts_grp_df.head(3)
gridspec.GridSpec(1,3)
label_cases = ['Under 20','20s','30s','40s','50s','60s','70s','80s','90 and over']
label_vacc = ['12-17yrs','18-29yrs','30s','40s','50s','60s','70s','80+']
color = ['#1D2F6F', '#8390FA', '#6EAF46', '#FAC748','#2FAE9F','#D5AE9F','#D52A70','#552A8A','#D4C48A']
fig = plt.figure(figsize=(35,30))
plt.subplot2grid((1,3),(0,0))
plt.pie(x='partially_vaccinated', labels=label_vacc, data = vacc_data_ts_grp_df, textprops= {'fontsize':16,'fontweight':'bold'}, rotatelabels=35, autopct='%1.1f%%', shadow=True)
plt.title("Patial Vaccination", fontdict={'fontsize':26,'fontweight':'bold','color':'darkblue'})
plt.legend(loc='upper right', ncol=2)
plt.subplot2grid((1,3),(0,1))
plt.pie(x='fully_vaccinated', labels=label_vacc, data = vacc_data_ts_grp_df, textprops= {'fontsize':16,'fontweight':'bold'}, rotatelabels=45, autopct='%1.1f%%', shadow=True)
plt.title("Full Vaccination", fontdict={'fontsize':26,'fontweight':'bold','color':'darkblue'})
plt.legend(loc='upper right', ncol=2)
plt.subplot2grid((1,3),(0,2))
plt.pie(x='case_count', labels=label_cases, data = age_cases_both_grouped_df, textprops= {'fontsize':16,'fontweight':'bold'}, rotatelabels=45, autopct='%1.1f%%', shadow=True, colors=color)
plt.title("Covid Cases throughout covid-19", fontdict={'fontsize':26,'fontweight':'bold','color':'darkblue'})
plt.legend(loc='upper right', ncol=3)
plt.show()
Conclusion
While cases are higher with the younger population, vaccination efforts have a higher spread amongst the older population. This might potential slow down the province's rate of overcoming the pandemic
# uses cases on a smaller scale like per 100 or per 1000
ontariocovid_vaccine_processing_df = ontariocovid_vaccine_ts_df.copy()
ontariocovid_vaccine_processing_df.head(2)
ontariocovid_vaccine_processing_df['cases_moving_average'] = ontariocovid_vaccine_processing_df['change_cases'].rolling(window=30).mean()
ontariocovid_vaccine_processing_df.head(2)
mobility_processing_df = mobility_ts_df.copy()
mobility_processing_df.head(2)
# focus is given more to 2021 as mobility data is for 2021
gridspec.GridSpec(1,3)
fig = plt.figure(figsize=(28,7))
plt.subplot2grid((1,3),(0,0))
ontariocovid_vaccine_processing_df['change_cases'].plot(color='darkred', label='Positive Cases Publicly Reported')
ontariocovid_vaccine_processing_df['cases_moving_average'].plot(color='darkgray', linestyle='dashed', label=' 30 Day Moving average', linewidth=2)
plt.annotate("08-01-21: 4249 cases", (250,325), xycoords='axes pixels', size=13)
plt.annotate("16-04-21: 4812 cases", (350,365.89), xycoords='axes pixels', size=13)
plt.annotate("02-02-21: 745 cases", (290,63.89), xycoords='axes pixels', size=13)
plt.ylabel('Number of cases')
plt.xlabel('Months of the Year')
plt.grid(axis='both')
plt.legend()
plt.subplot2grid((1,3),(0,1))
mobility_processing_df['retail_and_recreation_percent_change_from_baseline'].plot(color='#1D2F6F', label ='Retail and Recreation')
mobility_processing_df['retail_and_recreation_percent_change_from_baseline'].rolling(window=30).mean().plot(color='darkgray', linestyle='dashed',label=' 30 Day Moving average', linewidth=2)
plt.annotate("12-04-20: -74.68%",(20.67,11.99), xycoords='axes pixels', size=13)
plt.annotate("23-12-20: 11.37%", (244,321.92), xycoords='axes pixels', size=13)
plt.annotate("25-12-20: -75.63",(200.67,10.99), xycoords='axes pixels', size=13)
plt.annotate("01-04-21: 13.46%", (374.4,365.89), xycoords='axes pixels', size=13)
plt.ylabel('Percentage Change of Activities')
plt.xlabel('Months of the Year')
plt.title("Retail and Recreation Activities")
plt.grid(axis='both')
plt.legend()
plt.subplot2grid((1,3),(0,2))
mobility_processing_df['grocery_and_pharmacy_percent_change_from_baseline'].plot(color='#6EAF46', label='Grocery and Pharmacy')
mobility_processing_df['grocery_and_pharmacy_percent_change_from_baseline'].rolling(window=30).mean().plot(color='darkgray', linestyle='dashed',label=' 30 Day Moving average', linewidth=2)
plt.ylabel('Percentage Change of Activities')
plt.xlabel('Months of the Year')
plt.title("Grocery and Pharmacy")
plt.grid(axis='both')
plt.legend()
plt.show()
# color = ['#1D2F6F', '#8390FA', '#6EAF46', '#FAC748','#2FAE9F','#D5AE9F','#D52A70','#552A8A','#D4C48A', '#F4DABD']
Observation
-
Although Ontario publicly reported it's first case as of January 25th, the rates of progression became more obvious in March and the province recorded it's first death in the same month.
-
Ontario experienced it's first highest number of Cases per day on the 8th of January 2021 where 4249 people were reported to have tested positive. It then experienced a gradual decline with some and saw as low as only 745 positive cases on 2021-02-02. Although there were fluctuations in cases, it stayed under 2000 until 2021-03-25 with 2380 cases, then progressed gradually until the next peak on the 16th of April 2021 (4812 cases) - based on the data collected as of 17th July, 2021.
-
Retail and recreation have experienced more of a decline from the baseline that it has exceeded in 2021. This can possibly be attributed to the stay-at-home orders and states of emergency declarations that have limited people from moving freely and businesses from being open.
-
Although grocery and pharmacies have been exempted from most operational restrictions like closure, there has been some sudden peaks and dips but movement has stayed fairly consistent around the baseline with +-23% changes
</br> </br>
ADDITIONAL NOTE ON TIMELINES
-
Shortly after that, a state of emergency was declared and the non-essential movements were regulated as of Mar. 17th.
-
May 11, 2020: residents were allowed to walk, hike, bike and bird watch in provincial parks. Camping and access to beaches will remain closed.
-
May 16, 2020: Some businesses open: including campgrounds, marinas and golf courses.
-
Jul 31, 2020: Province was open again.
-
Sep. 8, 2020: pause on loosening any more restrictions.
-
Sept. 28, 2020: Restrictions started getting tightened and it was announced that the province was officially in the second wave of the pandemic.
-
Sep. 30, 2020: the province could see upwards of 1,000 cases a day in October, as the second wave is in full swing (Health officials).
-
Oct. 9, 2020: closure of indoor activities
-
Feb. 16 and 19, 2021: first set of provinces go out of lockdown except Toronto, Peel Region and North Bay-Parry Sound - for 2 weeks more (Mar. 2).
-
Apr. 7, 2021: The Ford government declares the province’s third state of emergency amid the COVID-19 pandemic and is issuing a provincewide stay-at-home order. The order will last for four weeks.
-
June 11, latest reopening
gridspec.GridSpec(1,3)
fig = plt.figure(figsize=(28,7))
plt.subplot2grid((1,3),(0,0))
ontariocovid_vaccine_processing_df['change_cases'].plot(color='darkred', label='Positive Cases Publicly Reported')
ontariocovid_vaccine_processing_df['cases_moving_average'].plot(color='darkgray', linestyle='dashed', label=' 30 Day Moving average', linewidth=2)
plt.annotate("08-01-21: 4249 cases", (250,325), xycoords='axes pixels', size=13)
plt.annotate("16-04-21: 4812 cases", (350,365.89), xycoords='axes pixels', size=13)
plt.annotate("02-02-21: 745 cases", (290,63.89), xycoords='axes pixels', size=13)
plt.ylabel('Number of cases')
plt.xlabel('Months of the Year')
plt.grid(axis='both')
plt.legend()
plt.subplot2grid((1,3),(0,1))
mobility_processing_df['parks_percent_change_from_baseline'].plot(color='#284C5D', label ='Parks')
mobility_processing_df['parks_percent_change_from_baseline'].rolling(window=30).mean().plot(color='darkgray', linestyle='dashed',label=' 30 Day Moving average', linewidth=2)
plt.annotate("Highest Positive Change from Baseline", (260.4,365.89), xycoords='axes pixels')
plt.ylabel('Percentage Change of Activities')
plt.xlabel('Months of the Year')
plt.title("Parks")
plt.grid(axis='both')
plt.legend(loc='upper left')
plt.subplot2grid((1,3),(0,2))
mobility_processing_df['transit_stations_percent_change_from_baseline'].plot(color='#552A8A', label='Transit Stations')
mobility_processing_df['transit_stations_percent_change_from_baseline'].rolling(window=30).mean().plot(color='darkgray', linestyle='dashed',label=' 30 Day Moving average', linewidth=2)
plt.annotate("10-04-20: -42.85%", (10.67,17.43), xycoords='axes pixels',size=13)
plt.annotate("01-08-20: 7.36%", (120,319.99), xycoords='axes pixels', size=13)
plt.annotate("01-01-21: -43.63%", (232.5,13.88), xycoords='axes pixels', size=13)
plt.ylabel('Percentage Change of Activities')
plt.xlabel('Months of the Year')
plt.title("Transit Stations")
plt.grid(axis='both')
plt.legend()
plt.show()
#
# color = ['#1D2F6F', '#8390FA', '#6EAF46', '#FAC748','#2FAE9F','#D5AE9F','#D52A70','#552A8A','#D4C48A', '#F4DABD']
Observation
-
Movements to the parks experrienced a decline early in the year, which is typically expected due to weather condition at the time. However, more drops below the baseline were experienced in March, leading up to the early parts of april, then it increased above the baseline as restrictions eased in the province. A similar fluctuating pattern can be observed throughout the period being analyzed.
-
For transit stations have seen less people movements. Although alot of fluctuations have been observed, it has consistently been below the baseline.
gridspec.GridSpec(1,3)
fig = plt.figure(figsize=(28,7))
plt.subplot2grid((1,3),(0,0))
ontariocovid_vaccine_processing_df['change_cases'].plot(color='darkred', label='Positive Cases Publicly Reported')
ontariocovid_vaccine_processing_df['cases_moving_average'].plot(color='darkgray', linestyle='dashed', label=' 30 Day Moving average', linewidth=2)
plt.annotate("08-01-21: 4249 cases", (250,325), xycoords='axes pixels', size=13)
plt.annotate("16-04-21: 4812 cases", (350,365.89), xycoords='axes pixels', size=13)
plt.annotate("02-02-21: 745 cases", (290,63.89), xycoords='axes pixels', size=13)
plt.ylabel('Number of cases')
plt.xlabel('Months of the Year')
plt.grid(axis='both')
plt.legend()
plt.subplot2grid((1,3),(0,1))
mobility_processing_df['workplaces_percent_change_from_baseline'].plot(color='#287EB2', label ='Work Places')
mobility_processing_df['workplaces_percent_change_from_baseline'].rolling(window=30).mean().plot(color='darkgray', linestyle='dashed',label=' 30 Day Moving average', linewidth=2)
plt.ylabel('Percentage Change of Activities')
plt.xlabel('Months of the Year')
plt.title("Work Places")
plt.grid(axis='both')
plt.legend()
plt.subplot2grid((1,3),(0,2))
mobility_processing_df['residential_percent_change_from_baseline'].plot(color='#E4C071', label='Residential')
mobility_processing_df['residential_percent_change_from_baseline'].rolling(window=30).mean().plot(color='darkgray', linestyle='dashed',label=' 30 Day Moving average', linewidth=2)
plt.ylabel('Percentage Change of Activities')
plt.xlabel('Months of the Year')
plt.title("Residential")
plt.grid(axis='both')
plt.legend()
plt.show()
#
# color = ['#1D2F6F', '#8390FA', '#6EAF46', '#FAC748','#2FAE9F','#D5AE9F','#D52A70','#552A8A','#D4C48A', '#F4DABD']
Observation
-
Although movement to workplaces was a little above the baseline early in March, a state of emergency was declared in the province on March 17th and it's effects can be observed as movements to workplaces dropped. Workplace-related movements have remained under the baseline throughout the pandemic. This can be attributable to the prolonged restrictions and the the fact that most workers who are able to fulfill the job responsibilities from home are working remotely.
-
Residential movements have stayed above the baseline almost all throughout the pandemic and experienced few occassional drops below the baseline.
Conclusion
- Despite activities slowing down and the preventative measures adopted by he government, the number of cases in the province continued to rise and saw it's two major peaks during the goevernment imposed stay-at-home order.
4. Conclusion
- The data shows that there is correlation between covid activties and the preventive solution - vaccinations. Although levels of correlation differ, total cases has a positive correlation with the totals of other activities.
- Total cases vs Total fatalities has a correlation of ~0.96
- Total cases vs Total tests has a correlation of ~0.98
- Total cases vs Total hospitalizations has a correlation of ~0.50
- Total cases vs Total criticals has a correlation of ~0.79
- Total cases vs Total recoveries has a correlation of ~0.99
- Total cases vs Partial vaccinations has a correlation of ~0.83
- Total cases vs Full vaccinations has a correlation of ~0.60
- Total cases vs Vaccines distributed has a correlation of ~0.83
-
While there is a very low correlation between the days of the week and total cases in ontario, a bar plot shows that the daily number of cases tends to differ across the different days of the week. Daily changes in covid cases seem to rises on Thursday into the weekend, fluctuates between saturday to sunday and makes a U-shaped movement between Monday and Wednesday. From the data, Friday, April 16 2021, is the day with the highest number of cases.
-
Furthermore, although the older population in Ontario are said to have a higher risk of contracting the virus, the data shows that there is a higher number of people testing positive amongst young adults in their 20s and 30s. Findings are based on only 99.14% collected as some rows were lost during data cleaning.
-
Despite cases being higher amongst the younger population, as of July 17-2021,preventive (vaccination) efforts had a higher spread amongst the older population. If events progress at this rate, it will likely slow down the speed with which the province overcomes the pandemic.
-
Additionally, irrespective of activities slowing down and the preventative measures, such as full lockdowns and restricted movements, adopted by the government, the number of cases in the province has continued to rise. It can also be noted that daily change in cases saw it's two major peaks during the goevernment imposed stay-at-home order.
Recommendations:
In the event of future pandemics, to overcome it's impact faster, it is recommended than Ontario;
- expands vaccination opportunities to include the younger demographic as this can potentially reduces the number of cases and prologned spread in the province.
- Continue large scale public education on hygiene measures such as; washing hands, wearing masks, sanitizing shared spaces etc, to minimize each individual's chances of contracting the virus.
- Analyse the impact of mobility restriction measures periodically to determine how viable that solution is. If cases tend to increase drastically at the end of lockdowns, it might be due to asymtomatic carriers suddenly mixing up with others whenever some degree of freedom is allowed.
- Explore limiting capacity as opposed to full lockdowns during a pandemic. This would likely decrease the sudden excitement for everyone to be outside at the same time and would increase the possibility of knowing who was where and when. E.g: via the barcode registrations required by some enclosed spaces presently.
Null hypothesis: There is no correlation between the features and variables in the dataset i.e. correlation coefficients of features and target variable is zero
Alternative hypothesis: There is linear correlation of 0.75 and over between features and the prediction target in the data.
# first - get the columns where correlation > 0.75
'''
Displays a list of columns that meet or exceed your specified correlation limit for linear regression
df: main data frame
map_column: this column would be used to check how other columns correlate with it
corr_limit: The mimimum degree/value to which the other column should correlate to the map_column
'''
collist = []
def get_cols_that_meet_corr_limit(df, map_column, corr_limit):
i = 0
for cols in df.columns:
if df[map_column].corr(df[cols], method='pearson') >= corr_limit:
print(f"{cols}: {df[map_column].corr(df[cols], method='pearson')}")
collist.append(cols)
i += 1
print("\n\nYou can apply the above columns to your df using 'collist'. \nSyntax: df[collist]")
get_cols_that_meet_corr_limit(ontariocovid_vaccine_ts_df, 'total_cases', 0.75)
tc_prediction_df = ontariocovid_vaccine_ts_df[collist]
tc_prediction_df.head(3)
features = tc_prediction_df.drop(columns=['total_fatalities','total_recoveries','total_cases','total_criticals','total_vaccinations','total_vaccines_distibuted'])
features.head()
target_prediction = tc_prediction_df['total_cases']
print(len(features))
print(len(target_prediction))
# put Xs together then Ys together - so that the split maps to features and target_pred. correctly
X_train, X_test, y_train, y_test = train_test_split(features, target_prediction, test_size=0.30, random_state=0)
model_tc = LinearRegression()
model_tc.fit(X_train,y_train)
print(f"The y_intercept of the model (beta_0) is: {model_tc.intercept_:.5f}")
print(f"The slopes of the model (beta_1) are: \n{model_tc.coef_[0]:.5f}: for changes_vaccination \n{model_tc.coef_[1]:.5f}: for total tests")
print(f"The model score on the training data is: {model_tc.score(X_train, y_train):.3f}")
print(len(X_test))
print(len(y_test))
lin_model_pred_df = X_test.copy()
lin_model_pred_df['total_cases'] = y_test
lin_model_pred_df['predicted_total_cases'] = np.round(model_tc.predict(X_test)).astype(int)
lin_model_pred_df.iloc[120:130]
print(f"The measure of accuracy for the model using r-squared is: {r2_score(lin_model_pred_df['total_cases'], lin_model_pred_df['predicted_total_cases']):.4f}")
print(f"The mean absolute error for the model is: {mean_absolute_error(lin_model_pred_df['total_cases'], lin_model_pred_df['predicted_total_cases']):.2f}")
print(f"The mean squared error for the model is: {mean_squared_error(lin_model_pred_df['total_cases'], lin_model_pred_df['predicted_total_cases']):.2f}")
fig3d = m3d.Axes3D(plt.figure())
fig3d.set_ylabel('Total_tests')
fig3d.set_zlabel('Predicted_total_cases')
fig3d.view_init(12, 225)
fig3d.scatter3D(xs=lin_model_pred_df['change_vaccinations'], ys=lin_model_pred_df['total_tests'], zs= lin_model_pred_df['total_cases'] , color = 'green')
fig3d.scatter3D(xs=lin_model_pred_df['change_vaccinations'], ys=lin_model_pred_df['total_tests'], zs= lin_model_pred_df['predicted_total_cases'] , color = 'red')
plt.show();
import pandas.util.testing as tm
import statsmodels.formula.api as smf
tc_ml_stats_df = smf.ols("total_cases ~ total_tests + change_vaccinations", data=ontariocovid_vaccine_ts_df)
# tc_ml_stats_df = smf.ols("total_fatalities ~ total_cases + total_tests + total_recoveries", data=ontariocovid_vaccine_ts_df)
output = tc_ml_stats_df.fit()
output.summary()
# diamond_reg_cut.summary()
dt_model = DecisionTreeRegressor(criterion='mse', max_depth=5, random_state=0)
dt_model.fit(X_train, y_train)
print(f"The decision tree regressor model score is: {dt_model.score(X_train, y_train):.4f}") # getting model score
dt_lin_pred_df = lin_model_pred_df.copy()
dt_lin_pred_df.head(2)
dt_lin_pred_df['dt_predicted_total_cases'] = np.round(dt_model.predict(X_test)).astype(int)
dt_lin_pred_df.head()
print(f"The measure of accuracy for the decision tree model using r-squared is: {r2_score(dt_lin_pred_df['total_cases'], dt_lin_pred_df['dt_predicted_total_cases']):.4f}")
print(f"The mean absolute error for the decision tree model is: {mean_absolute_error(dt_lin_pred_df['total_cases'], dt_lin_pred_df['dt_predicted_total_cases']):.4f}")
print(f"The mean squared error for the decision tree model is: {mean_squared_error(dt_lin_pred_df['total_cases'], dt_lin_pred_df['dt_predicted_total_cases']):.4f}")
from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor(max_depth=5, random_state=0).fit(X_train, y_train)
rf_model.score(X_train, y_train)
rf_lin_pred_df = dt_lin_pred_df.copy()
rf_lin_pred_df.head(2)
rf_lin_pred_df['rf_predicted_total_cases'] = np.round(rf_model.predict(X_test)).astype(int)
rf_lin_pred_df.tail()
print(f"The measure of accuracy for the random forest regression model using r-squared is: {r2_score(rf_lin_pred_df['total_cases'], rf_lin_pred_df['rf_predicted_total_cases']):.4f}")
print(f"The mean absolute error for the random forest regression model is: {mean_absolute_error(rf_lin_pred_df['total_cases'], rf_lin_pred_df['rf_predicted_total_cases']):.4f}")
print(f"The mean squared error for the random forest regression model is: {mean_squared_error(rf_lin_pred_df['total_cases'], rf_lin_pred_df['rf_predicted_total_cases']):.4f}")
Observation:
-
Regression models were used to predict the total number of cases daily in ontario as the data is continous.
-
Linear regression was choosen for modelling because strong linear relationships, moving toward 1, were identified between total cases and other features in the data.
-
The model was trained based on patterns learned from total tests and changes in daily partial vaccination activities in the province. These features were selected as in addition to their linear relationship with total cases, a lower amount of multicollinearity was obeserved between the independent variables.
-
For the models created with test_train_split data, a test_size of 30% was defined, thereby making the train_size on 70% of the data.
-
The Linear regression model had an approximate accuracy score of 0.982. After model fitting, the model was used to predict total cases, given the the test features and the R-squared error was approximately 0.9815.
-
validating the model using statsmodel.formala.api, an R-squared score of approximately 0.982 was also obtained and the confidence level of prediction accuracy is 97.5%.
-
Based on the linear regression model, it can be interpreted that for every case increase in the total number of cases in ontario, there'll be aproximately a 0.799 increase in partial vaccinations and a 0.025 increase in total tests.
-
Mathematically:
total_cases = -11860.454 + (0.799 * change_vaccinations) + (0.025 * total_tests) -
To validate the model, a decision tree regressor was explored for prediction. This model shows that it has better prediction abilities with an R-squared score of 0.993 and lower mean errors (MAE and MSE) for it's prediction on the test data, in comparison to linear regression.
-
Further attempt to predict total cases per day was done using random forest regressor. This is because the model uses features at random and has a higher likelihood of reducing bias, unlike decison tree which uses all the features. This model was shown to have a higher accuracy score and lower mean errors than the previous models (
Recommended). -
We will reject the null hypothesis as we have sufficient statistical evidence against the null hypothesis
gridspec.GridSpec(1,3)
plt.figure(figsize=(19,6.5))
plt.subplot2grid((1,3),(0,0))
sns.scatterplot(x= lin_model_pred_df.index, y=lin_model_pred_df['total_cases'], color='red', label='Actual_total_cases')
lin_model_pred_df['predicted_total_cases'].plot(color='green')
plt.title("Linear Reg. Prediction (Train/Test split)", fontdict={'fontweight':'bold'})
plt.legend()
plt.subplot2grid((1,3),(0,1))
sns.scatterplot(x= rf_lin_pred_df.index, y=rf_lin_pred_df['dt_predicted_total_cases'], color='red', label='Actual_total_cases')
lin_model_pred_df['predicted_total_cases'].plot(color='green')
plt.title("Decision Tree Regressor Prediction (Train/Test split)", fontdict={'fontweight':'bold'})
plt.legend()
plt.subplot2grid((1,3),(0,2))
sns.scatterplot(x= rf_lin_pred_df.index, y=rf_lin_pred_df['rf_predicted_total_cases'], color='red', label='Actual_total_cases')
lin_model_pred_df['predicted_total_cases'].plot(color='green')
plt.title("Random Forest Regressor Prediction (Train/Test split)", fontdict={'fontweight':'bold'})
plt.legend()
plt.show()
# First, initialize KFold
kf_select = KFold(n_splits=5, shuffle = True, random_state=1)
kf_df = pd.concat([features, target_prediction], axis=1)
kf_df
set_1, set_2, set_3, set_4, set_5 = kf_select.split(kf_df)
set_1
# train
kf_X_train_1 = kf_df.iloc[set_1[0], :-1]
kf_y_train_1 = kf_df.iloc[set_1[0], -1]
# test
kf_X_test_1 = kf_df.iloc[set_1[1], :-1]
kf_y_test_1 = kf_df.iloc[set_1[1], -1]
kf_test_df = pd.concat([kf_X_test_1, kf_y_test_1], axis=1)
result = model_tc.fit(kf_X_train_1, kf_y_train_1)
print(f"The accuracy score for set 1: {result.score(kf_X_train_1, kf_y_train_1)}\n\n")
print(f"The intercept for set 1 (lin_reg.): {result.intercept_}\n\n")
print(f"The slopes for for set 1 are: \nchange_vaccinations: {result.coef_[0]} \nchange_tests: {result.coef_[1]}")
kf_test_df['pred_lin_set1'] = np.round(result.predict(kf_X_test_1)).astype(int)
kf_test_df.tail(7)
# train data:
kf_X_train_2 = kf_df.iloc[set_2[0], :-1]
kf_y_train_2 = kf_df.iloc[set_2[0], -1]
# test data:
kf_X_test_2 = kf_df.iloc[set_2[1], :-1]
# training the model and measuring accuracy
result_2 = model_tc.fit(kf_X_train_2, kf_y_train_2)
print(f"The model score is: {result_2.score(kf_X_train_2, kf_y_train_2)}")
# adding predicted total cases as a column to kf_test_df
kf_test_df['pred_lin_set2'] = np.round(result_2.predict(kf_X_test_2)).astype(int)
kf_test_df.tail(7)
# train data:
kf_X_train_3 = kf_df.iloc[set_3[0], :-1]
kf_y_train_3 = kf_df.iloc[set_3[0], -1]
# test data:
kf_X_test_3 = kf_df.iloc[set_3[1], :-1]
# training the model and measuring accuracy
result_3 = model_tc.fit(kf_X_train_3, kf_y_train_3)
print(f"The model score is: {result_3.score(kf_X_train_3, kf_y_train_3)}")
# adding predicted total cases as a column to kf_test_df
kf_test_df['pred_lin_set3'] = np.round(result_3.predict(kf_X_test_3)).astype(int)
kf_test_df.tail(7)
# train data:
kf_X_train_4 = kf_df.iloc[set_4[0], :-1]
kf_y_train_4 = kf_df.iloc[set_4[0], -1]
# test data:
kf_X_test_4 = kf_df.iloc[set_4[1], :-1]
# training the model and measuring accuracy
result_4 = model_tc.fit(kf_X_train_4, kf_y_train_4)
print(f"The model score is: {result_4.score(kf_X_train_4, kf_y_train_4)}")
# adding predicted total cases as a column to kf_test_df
kf_test_df['pred_lin_set4'] = np.round(result_4.predict(kf_X_test_4)).astype(int)
kf_test_df.tail(7)
# train data:
kf_X_train_5 = kf_df.iloc[set_5[0], :-1]
kf_y_train_5 = kf_df.iloc[set_5[0], -1]
# test data:
kf_X_test_5 = kf_df.iloc[set_5[1], :-1]
# training the model and measuring accuracy
result_5 = model_tc.fit(kf_X_train_5, kf_y_train_5)
print(f"The model score is: {result_5.score(kf_X_train_5, kf_y_train_5)}")
# adding predicted total cases as a column to kf_test_df
kf_test_df['pred_lin_set5'] = np.round(result_5.predict(kf_X_test_5)).astype(int)
kf_test_df.tail(7)
print("KFold cross validation of Linear Regression model with 5 shuffled splits of the data: \n\n")
# set 1:
print(f"For set 1, the measure of accuracy for the model using r-squared is: {r2_score(kf_test_df['total_cases'], kf_test_df['pred_lin_set1']):.4f}")
print(f"For set 1, the mean absolute error for the model is: {mean_absolute_error(kf_test_df['total_cases'], kf_test_df['pred_lin_set1']):.2f}")
print(f"For set 1, the mean squared error for the model is: {mean_squared_error(kf_test_df['total_cases'], kf_test_df['pred_lin_set1']):.2f} \n\n")
# set 2:
print(f"For set 2, the measure of accuracy for the model using r-squared is: {r2_score(kf_test_df['total_cases'], kf_test_df['pred_lin_set2']):.4f}")
print(f"For set 2, the mean absolute error for the model is: {mean_absolute_error(kf_test_df['total_cases'], kf_test_df['pred_lin_set2']):.2f}")
print(f"For set 2, the mean squared error for the model is: {mean_squared_error(kf_test_df['total_cases'], kf_test_df['pred_lin_set2']):.2f} \n\n")
# set 3:
print(f"For set 3, the measure of accuracy for the model using r-squared is: {r2_score(kf_test_df['total_cases'], kf_test_df['pred_lin_set3']):.4f}")
print(f"For set 3, the mean absolute error for the model is: {mean_absolute_error(kf_test_df['total_cases'], kf_test_df['pred_lin_set3']):.2f}")
print(f"For set 3, the mean squared error for the model is: {mean_squared_error(kf_test_df['total_cases'], kf_test_df['pred_lin_set3']):.2f} \n\n")
# set 4:
print(f"For set 4, the measure of accuracy for the model using r-squared is: {r2_score(kf_test_df['total_cases'], kf_test_df['pred_lin_set4']):.4f}")
print(f"For set 4, the mean absolute error for the model is: {mean_absolute_error(kf_test_df['total_cases'], kf_test_df['pred_lin_set4']):.2f}")
print(f"For set 4, the mean squared error for the model is: {mean_squared_error(kf_test_df['total_cases'], kf_test_df['pred_lin_set4']):.2f} \n\n")
# set 5:
print(f"For set 5, the measure of accuracy for the model using r-squared is: {r2_score(kf_test_df['total_cases'], kf_test_df['pred_lin_set5']):.4f}")
print(f"For set 5, the mean absolute error for the model is: {mean_absolute_error(kf_test_df['total_cases'], kf_test_df['pred_lin_set5']):.2f}")
print(f"For set 5, the mean squared error for the model is: {mean_squared_error(kf_test_df['total_cases'], kf_test_df['pred_lin_set5']):.2f} \n\n")
Observation:
-
The
set_1, derived from a KFold split, made better predictions with a higher accuracy score than the other sets from the kfold split. -
Mathematically, for set_1:
total_cases = -11660.203 + (0.860 * change_vaccinations) + (0.025 * total_tests)
- X and y from
set 1are used in other models
kf_dt_result = dt_model.fit(kf_X_train_1, kf_y_train_1)
print(f"The model score is: {kf_dt_result.score(kf_X_train_1, kf_y_train_1)}")
# adding predicted total cases as a column to kf_test_df
kf_test_df['dt_pred_set1'] = np.round(kf_dt_result.predict(kf_X_test_1)).astype(int)
kf_test_df.tail()
kf_rf_result = rf_model.fit(kf_X_train_1, kf_y_train_1)
print(f"The model score is: {kf_rf_result.score(kf_X_train_1, kf_y_train_1)}")
# adding predicted total cases as a column to kf_test_df
kf_test_df['rf_pred_set1'] = np.round(kf_rf_result.predict(kf_X_test_1)).astype(int)
kf_test_df.tail(10)
print("Linear Regression:\n")
print(f"For set 1, the measure of accuracy for the model using r-squared is: {r2_score(kf_test_df['total_cases'], kf_test_df['pred_lin_set1']):.4f}")
print(f"For set 1, the mean absolute error for the model is: {mean_absolute_error(kf_test_df['total_cases'], kf_test_df['pred_lin_set1']):.2f}")
print(f"For set 1, the mean squared error for the model is: {mean_squared_error(kf_test_df['total_cases'], kf_test_df['pred_lin_set1']):.2f} \n\n")
print("Decision Tree Regressor:\n")
print(f"For set 1, the measure of accuracy for the model using r-squared is: {r2_score(kf_test_df['total_cases'], kf_test_df['dt_pred_set1']):.4f}")
print(f"For set 1, the mean absolute error for the model is: {mean_absolute_error(kf_test_df['total_cases'], kf_test_df['dt_pred_set1']):.2f}")
print(f"For set 1, the mean squared error for the model is: {mean_squared_error(kf_test_df['total_cases'], kf_test_df['dt_pred_set1']):.2f} \n\n")
print("Random Forest Regressor:\n")
print(f"For set 1, the measure of accuracy for the model using r-squared is: {r2_score(kf_test_df['total_cases'], kf_test_df['rf_pred_set1']):.4f}")
print(f"For set 1, the mean absolute error for the model is: {mean_absolute_error(kf_test_df['total_cases'], kf_test_df['rf_pred_set1']):.2f}")
print(f"For set 1, the mean squared error for the model is: {mean_squared_error(kf_test_df['total_cases'], kf_test_df['rf_pred_set1']):.2f} \n\n")
Observation
- A 5-set kfold split was done to derive 5 distinct arrangements of a dataframe containing a prediction traget and predictors.
-
Testing on a linear regression model,
set_1, derived from the split, made the best predictions of the 5. -
Mathematically, for set_1:
total_cases = -11660.203 + (0.860 * change_vaccinations) + (0.025 * total_tests) -
Others models were explored using only X and y from
set 1 -
The linear regression model has an accuracy score of approximately 0.9771
-
With the decision tree regressor, the prediction accuracy score (r-squared) improved from what was observed with linear regression to ~ 0.9991
-
This even got better with the random forest regressor model, where the r-squared score was computed as ~ 0.9998.
-
In both decision tree and random forest regressors, the errors metrices with set_1 are lower than what was observed with the train_test_split data.
-
However, random forest model made predictions with the least amount of errors across the 3 models and is recommended.
gridspec.GridSpec(1,3)
plt.figure(figsize=(19,6.5))
plt.subplot2grid((1,3),(0,0))
sns.scatterplot(x= kf_test_df.index, y=kf_test_df['total_cases'], color='red', label='Actual_total_cases')
kf_test_df['pred_lin_set1'].plot(color='green')
plt.title("Linear Reg. Prediction on set_1", fontdict={'fontweight':'bold'})
plt.legend()
plt.subplot2grid((1,3),(0,1))
sns.scatterplot(x= kf_test_df.index, y=kf_test_df['total_cases'], color='red', label='Actual_total_cases')
kf_test_df['dt_pred_set1'].plot(color='green')
plt.title("Decision Tree Regressor Prediction on set_1", fontdict={'fontweight':'bold'})
plt.legend()
plt.subplot2grid((1,3),(0,2))
sns.scatterplot(x= kf_test_df.index, y=kf_test_df['total_cases'], color='red', label='Actual_total_cases')
kf_test_df['rf_pred_set1'].plot(color='green')
plt.title("Random Forest Regressor Prediction on set_1", fontdict={'fontweight':'bold'})
plt.legend()
plt.show()